Random Vectors and the Gauss-Markov Theorem

Dr. Lucy D’Agostino McGowan

Properties of Random Vectors

What is a Random Vector?

Definition: A collection of random variables in a vector
\[\mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}\]

Expected Value Properties

Linearity property:
\[E[\mathbf{A}\mathbf{Y} + \mathbf{b}] = \mathbf{A}E[\mathbf{Y}] + \mathbf{b}\]

Why this works: Expectation distributes over linear combinations

Key assumption: \(\mathbf{A}\) is constant, not random

Variance-Covariance Matrix

Definition:
\[\text{Var}(\mathbf{Y}) = E[(\mathbf{Y} - E[\mathbf{Y}])(\mathbf{Y} - E[\mathbf{Y}])^T]\]

This creates an \(n \times n\) matrix

Structure of Var-Cov Matrix

\[\text{Var}(\mathbf{Y}) = \begin{bmatrix} \text{Var}(Y_1) & \text{Cov}(Y_1, Y_2) & \cdots \\\text{Cov}(Y_2, Y_1) & \text{Var}(Y_2) & \cdots \\\vdots & \vdots & \ddots\end{bmatrix}\]

Diagonal: individual variances
Off-diagonal: covariances between pairs

Variance Properties

Key transformation rule:
\[\text{Var}(\mathbf{A}\mathbf{Y} + \mathbf{b}) = \mathbf{A}\text{Var}(\mathbf{Y})\mathbf{A}^T\]

Note: Constants \(\mathbf{b}\) don’t affect variance! But constant \(\mathbf{A}\) does.

Variance Intuition

Think of it as: \((\mathbf{A} \times \text{variability} \times \mathbf{A}^T)\)

Matrix \(\mathbf{A}\) transforms the variables
Variance gets “stretched” by \(\mathbf{A}\) on both sides

You Try: Simple Variance

Given: \(\mathbf{Y} = \begin{bmatrix} Y_1 \\ Y_2 \end{bmatrix}\) with \(\text{Var}(\mathbf{Y}) = \begin{bmatrix} 4 & 1 \\ 1 & 9 \end{bmatrix}\)

Find: \(\text{Var}(2Y_1 + 3Y_2)\)

03:00

You Try: Setup

Express as: \(\mathbf{A}\mathbf{Y}\) where \(\mathbf{A} = [2, 3]\)

Apply formula:
\[\text{Var}(2Y_1 + 3Y_2) = \mathbf{A}\text{Var}(\mathbf{Y})\mathbf{A}^T\]

You Try: Calculation

\[[2, 3] \begin{bmatrix} 4 & 1 \\ 1 & 9 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \end{bmatrix}\]

\[= [2, 3] \begin{bmatrix} 11 \\ 29 \end{bmatrix} = 109\]

Verification

# Using matrix formula
Sigma <- matrix(c(4, 1, 1, 9), 2, 2)
A <- matrix(c(2, 3), 1, 2)
A %*% Sigma %*% t(A)
     [,1]
[1,]  109

Simulation Check

# Verify by simulation
set.seed(919)
library(MASS)
Y_sim <- mvrnorm(100000, mu = c(0, 0), Sigma = Sigma)
linear_combo <- 2 * Y_sim[,1] + 3 * Y_sim[,2]
var(linear_combo)
[1] 108.8609

The Gauss-Markov Theorem

Linear Regression Model

The model:
\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]

We will call this “ordinary least squares” (OLS)

The Big Question

Among all linear, unbiased estimators of \(\boldsymbol{\beta}\)

Which one has the smallest variance?

Gauss-Markov Assumptions

Assumption 1: Linearity
The model is \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\)

Assumption 2: Zero mean errors
\(E[\boldsymbol{\varepsilon}] = \mathbf{0}\)

More GM Assumptions

Assumption 3: Constant variance & independence
\(\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}\)

Assumption 4: Full rank
\(\mathbf{X}\) has full column rank (no perfect multicollinearity)

What is Assumption 3?

Homoscedasticity: All errors have same variance \(\sigma^2\)
\(\text{Var}(\varepsilon_i) = \sigma^2 \text{ for all } i\)

Independence: Errors are uncorrelated
\(\text{Cov}(\varepsilon_i, \varepsilon_j) = 0 \text{ for } i \neq j\)

Linear Unbiased Estimators

Linear estimator: \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\)
where \(\mathbf{C}\) doesn’t depend on \(\mathbf{y}\)

Unbiased: \(E[\tilde{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)

Examples of Linear Estimators

OLS estimator:
\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

Any weighted average of the data

The Gauss-Markov Theorem

Theorem: Under GM assumptions, OLS is BLUE

BLUE = Best Linear Unbiased Estimator
“Best” = smallest variance

Proof Overview

Step 1: Show OLS is unbiased

Step 2: Find variance of OLS

Step 3: Show any other linear unbiased estimator has larger variance

Proving OLS is Unbiased

Starting Point

We want to show: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)

Start with the OLS formula:
\[E[\hat{\boldsymbol{\beta}}] = E[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}]\]

Substitute the Model

Replace \(\mathbf{y}\) with \(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\):

\[E[\hat{\boldsymbol{\beta}}] = E[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon})]\]

Distribute the Matrix

Multiply through: \[= E[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}]\]

Simplify the First Term

Note that: \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X} = \mathbf{I}\)

So we get: \[= E[\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}]\]

Use Linearity of Expectation

Expectation of a sum = sum of expectations: \[= \boldsymbol{\beta} + E[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}]\]

Important Note: β as Fixed Parameter

In classical regression: \(\beta\) is a fixed but unknown parameter

Randomness comes from: \(\varepsilon\) (and therefore y), not from \(\beta\)

This is why: we can pull \(\beta\) out of expectations like a constant

Move Constants Out

Since \(\mathbf{X}\) is fixed (not random): \[= \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T E[\boldsymbol{\varepsilon}]\]

Apply Zero Mean Assumption

From Assumption 2: \(E[\boldsymbol{\varepsilon}] = \mathbf{0}\)

Therefore: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{0} = \boldsymbol{\beta}\)

Unbiasedness: Complete ✓

We have shown: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)

OLS is unbiased under GM assumptions

Finding the Variance of OLS

You Try: OLS Variance

Calculate \(\text{Var}(\hat{\boldsymbol{\beta}})\) where:
\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

08:00

Hint: Express in Terms of Errors

From our unbiasedness proof, we found: \[\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}\]

Use Variance Properties

Since \(\boldsymbol{\beta}\) is constant (has zero variance): \[\text{Var}(\hat{\boldsymbol{\beta}}) = \text{Var}((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon})\]

Apply Matrix Variance Formula

Let \(\mathbf{A} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\), then: \[\text{Var}(\mathbf{A}\boldsymbol{\varepsilon}) = \mathbf{A}\text{Var}(\boldsymbol{\varepsilon})\mathbf{A}^T\]

Substitute Error Variance

From Assumption 3: \(\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}\)

Therefore: \[= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \cdot \sigma^2\mathbf{I} \cdot \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\]

Simplify

Pull out the scalar \(\sigma^2\): \[= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\]

Final answer: \[\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\]

Proving OLS Has Minimum Variance

The Challenge

Goal: Show any other linear unbiased estimator has larger variance

Strategy: Consider any linear unbiased estimator \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\)

Unbiasedness Constraint

For \(\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\) to be unbiased: \[E[\mathbf{C}\mathbf{y}] = E[\mathbf{C}(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon})] = \mathbf{C}\mathbf{X}\boldsymbol{\beta}\]

The Key Constraint

This must equal \(\boldsymbol{\beta}\) for any value of \(\boldsymbol{\beta}\)

Therefore we need: \(\mathbf{C}\mathbf{X} = \mathbf{I}\)

Clever Decomposition

Write any unbiased \(\mathbf{C}\) as: \(\mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}\)

where \(\mathbf{D}\) is some matrix

Why This Decomposition is Clever

Intuition: Any estimator = OLS + some deviation

Key insight: The deviation \(\mathbf{D}\) can only add variance, never reduce it

Mathematical power: Separates what we know (OLS) from the unknown part

Why This Decomposition Works

Check the constraint \(\mathbf{C}\mathbf{X} = \mathbf{I}\): \([(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}]\mathbf{X} = \mathbf{I} + \mathbf{D}\mathbf{X}\)

Constraint on D

For the constraint to hold, we need: \[\mathbf{D}\mathbf{X} = \mathbf{0}\]

This is the key restriction on \(\mathbf{D}\)

Comparing Variances

Variance of Alternative Estimator

Start with: \[\text{Var}(\tilde{\boldsymbol{\beta}}) = \text{Var}(\mathbf{C}\mathbf{y}) = \mathbf{C}\text{Var}(\mathbf{y})\mathbf{C}^T\]

Substitute Error Variance

Since \(\text{Var}(\mathbf{y}) = \text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}\): \[= \sigma^2\mathbf{C}\mathbf{C}^T\]

Substitute Our Decomposition

Replace \(\mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}\): \[\mathbf{C}\mathbf{C}^T = [(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}][(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}]^T\]

Expand the Product

This gives us four terms: \[= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{D}^T\] \[+ \mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T\]

Cross Terms Vanish

Since \(\mathbf{D}\mathbf{X} = \mathbf{0}\):

  • \(\mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} = \mathbf{0}\)
  • \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{D}^T = (\mathbf{D}\mathbf{X})^T(\mathbf{X}^T\mathbf{X})^{-1} = \mathbf{0}\)

Simplified Result

After cancellation: \[\mathbf{C}\mathbf{C}^T = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T\]

Which simplifies to: \[= (\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T\]

The Final Comparison

Therefore: \[\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T]\]

Key Mathematical Fact

\(\mathbf{D}\mathbf{D}^T\) is positive semi-definite

This means: \(\mathbf{D}\mathbf{D}^T \geq \mathbf{0}\) (in matrix sense)

Conclusion: OLS is Best

Since \(\mathbf{D}\mathbf{D}^T \geq \mathbf{0}\): \[\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} = \text{Var}(\hat{\boldsymbol{\beta}})\]

Proof Complete: OLS is BLUE

We have shown:

  • OLS is linear: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)
  • OLS is unbiased: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)

And OLS has minimum variance among all linear unbiased estimators